{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Build RAG with Hugging Face and Milvus\n",
    "\n",
    "_Authored by: [Chen Zhang](https://github.com/zc277584121)_\n",
    "\n",
    "\n",
    "[Milvus](https://milvus.io/) is a popular open-source vector database that powers AI applications with highly performant and scalable vector similarity search. In this tutorial, we will show you how to build a RAG (Retrieval-Augmented Generation) pipeline with Hugging Face and Milvus.\n",
    "\n",
    "The RAG system combines a retrieval system with an LLM. The system first retrieves relevant documents from a corpus using Milvus vector database, then uses an LLM hosted in Hugging Face to generate answers based on the retrieved documents."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## Preparation\n",
    "### Dependencies and Environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "! pip install --upgrade pymilvus sentence-transformers huggingface-hub langchain_community langchain-text-splitters pypdf tqdm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "> If you are using Google Colab, to enable the dependencies, you may need to **restart the runtime** (click on the \"Runtime\" menu at the top of the screen, and select \"Restart session\" from the dropdown menu)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "In addition, we recommend that you configure your [Hugging Face User Access Token](https://huggingface.co/docs/hub/security-tokens), and set it in your environment variables because we will use a LLM from the Hugging Face Hub. You may get a low limit of requests if you don't set the token environment variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"HF_TOKEN\"] = \"hf_...\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare the data\n",
    "\n",
    "We use the [AI Act PDF](https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf), a regulatory framework for AI with different risk levels corresponding to more or less regulation, as the private knowledge in our RAG."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "if [ ! -f \"The-AI-Act.pdf\" ]; then\n",
    "    wget -q https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf\n",
    "fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the [`PyPDFLoader`](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf/) from LangChain to extract the text from the PDF, and then split the text into smaller chunks. By default, we set the chunk size as 1000 and the overlap as 200, which means each chunk will nearly have 1000 characters and the overlap between two chunks will be 200 characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "108\n"
     ]
    }
   ],
   "source": [
    "from langchain_community.document_loaders import PyPDFLoader\n",
    "\n",
    "loader = PyPDFLoader(\"The-AI-Act.pdf\")\n",
    "docs = loader.load()\n",
    "print(len(docs))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)\n",
    "chunks = text_splitter.split_documents(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "text_lines = [chunk.page_content for chunk in chunks]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare the Embedding Model\n",
    "Define a function to generate text embeddings. We use [BGE embedding model](https://huggingface.co/BAAI/bge-small-en-v1.5) as an example, but you can use any embedding models, such as those found on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sentence_transformers import SentenceTransformer\n",
    "\n",
    "embedding_model = SentenceTransformer(\"BAAI/bge-small-en-v1.5\")\n",
    "\n",
    "def emb_text(text):\n",
    "    return embedding_model.encode([text], normalize_embeddings=True).tolist()[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generate a test embedding and print its dimension and first few elements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "384\n",
      "[-0.07660683244466782, 0.025316666811704636, 0.012505513615906239, 0.004595153499394655, 0.025780051946640015, 0.03816710412502289, 0.08050819486379623, 0.003035430097952485, 0.02439221926033497, 0.0048803347162902355]\n"
     ]
    }
   ],
   "source": [
    "test_embedding = emb_text(\"This is a test\")\n",
    "embedding_dim = len(test_embedding)\n",
    "print(embedding_dim)\n",
    "print(test_embedding[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load data into Milvus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create the Collection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pymilvus import MilvusClient\n",
    "\n",
    "milvus_client = MilvusClient(uri=\"./hf_milvus_demo.db\")\n",
    "\n",
    "collection_name = \"rag_collection\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "> As for the argument of `MilvusClient`:\n",
    "> - Setting the `uri` as a local file, e.g.`./hf_milvus_demo.db`, is the most convenient method, as it automatically utilizes [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to store all data in this file.\n",
    "> - If you have a large amount of data, say more than a million vectors, you can set up a more performant Milvus server on [Docker or Kubernetes](https://milvus.io/docs/quickstart.md). In this setup, please use the server uri, e.g.`http://localhost:19530`, as your `uri`.\n",
    "> - If you want to use [Zilliz Cloud](https://zilliz.com/cloud), the fully managed cloud service for Milvus, adjust the `uri` and `token`, which correspond to the [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#cluster-details) in Zilliz Cloud.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check if the collection already exists and drop it if it does."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "if milvus_client.has_collection(collection_name):\n",
    "    milvus_client.drop_collection(collection_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a new collection with specified parameters. \n",
    "\n",
    "If we don't specify any field information, Milvus will automatically create a default `id` field for primary key, and a `vector` field to store the vector data. A reserved JSON field is used to store non-schema-defined fields and their values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "milvus_client.create_collection(\n",
    "    collection_name=collection_name,\n",
    "    dimension=embedding_dim,\n",
    "    metric_type=\"IP\",  # Inner product distance\n",
    "    consistency_level=\"Strong\",  # Strong consistency level\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Insert data\n",
    "Iterate through the text lines, create embeddings, and then insert the data into Milvus.\n",
    "\n",
    "Here is a new field `text`, which is a non-defined field in the collection schema. It will be automatically added to the reserved JSON dynamic field, which can be treated as a normal field at a high level."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Creating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 429/429 [00:52<00:00,  8.10it/s]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "429"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from tqdm import tqdm\n",
    "\n",
    "data = []\n",
    "\n",
    "for i, line in enumerate(tqdm(text_lines, desc=\"Creating embeddings\")):\n",
    "    data.append({\"id\": i, \"vector\": emb_text(line), \"text\": line})\n",
    "\n",
    "insert_res = milvus_client.insert(collection_name=collection_name, data=data)\n",
    "insert_res[\"insert_count\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build RAG\n",
    "\n",
    "### Retrieve data for a query\n",
    "\n",
    "Let's specify a question to ask about the corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "question = \"What is the legal basis for the proposal?\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Search for the question in the collection and retrieve the top 3 semantic matches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "search_res = milvus_client.search(\n",
    "    collection_name=collection_name,\n",
    "    data=[\n",
    "        emb_text(question)\n",
    "    ],  # Use the `emb_text` function to convert the question to an embedding vector\n",
    "    limit=3,  # Return top 3 results\n",
    "    search_params={\"metric_type\": \"IP\", \"params\": {}},  # Inner product distance\n",
    "    output_fields=[\"text\"],  # Return the text field\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at the search results of the query\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\n",
      "    [\n",
      "        \"EN 6  EN 2. LEGAL  BASIS,  SUBSIDIARITY  AND  PROPORTIONALITY  \\n2.1. Legal  basis  \\nThe legal basis for the proposal is in the first place Article 114 of the Treaty on the \\nFunctioning of the European Union (TFEU), which provides for the adoption of measures to \\nensure the establishment and f unctioning of the internal market.  \\nThis proposal constitutes a core part of the EU digital single market strategy. The primary \\nobjective of this proposal is to ensure the proper functioning of the internal market by setting \\nharmonised rules in particular on the development, placing on the Union market and the use \\nof products and services making use of AI technologies or provided as stand -alone AI \\nsystems. Some Member States are already considering national rules to ensure that AI is safe \\nand is developed a nd used in compliance with fundamental rights obligations. This will likely \\nlead to two main problems: i) a fragmentation of the internal market on essential elements\",\n",
      "        0.7412998080253601\n",
      "    ],\n",
      "    [\n",
      "        \"applications and prevent market fragmentation.  \\nTo achieve those objectives, this proposal presents a balanced and proportionate horizontal \\nregulatory approach to AI that is limited to the minimum necessary requirements to address \\nthe risks and problems linked to AI, withou t unduly constraining or hindering technological \\ndevelopment or otherwise disproportionately increasing the cost of placing AI solutions on \\nthe market.  The proposal sets a robust and flexible legal framework. On the one hand, it is \\ncomprehensive and future -proof in its fundamental regulatory choices, including the \\nprinciple -based requirements that AI systems should comply with. On the other hand, it puts \\nin place a proportionate regulatory system centred on a well -defined risk -based regulatory \\napproach that  does not create unnecessary restrictions to trade, whereby legal intervention is \\ntailored to those concrete situations where there is a justified cause for concern or where such\",\n",
      "        0.696428656578064\n",
      "    ],\n",
      "    [\n",
      "        \"approach that  does not create unnecessary restrictions to trade, whereby legal intervention is \\ntailored to those concrete situations where there is a justified cause for concern or where such \\nconcern can reasonably be anticipated in the near future. At the same time, t he legal \\nframework includes flexible mechanisms that enable it to be dynamically adapted as the \\ntechnology evolves and new concerning situations emerge.  \\nThe proposal sets harmonised rules for the development, placement on the market and use of \\nAI systems i n the Union following a proportionate risk -based approach. It proposes a single \\nfuture -proof definition of AI. Certain particularly harmful AI practices are prohibited as \\ncontravening Union values, while specific restrictions and safeguards are proposed in  relation \\nto certain uses of remote biometric identification systems for the purpose of law enforcement. \\nThe proposal lays down a solid risk methodology to define \\u201chigh -risk\\u201d AI systems that pose\",\n",
      "        0.6891457438468933\n",
      "    ]\n",
      "]\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "retrieved_lines_with_distances = [\n",
    "    (res[\"entity\"][\"text\"], res[\"distance\"]) for res in search_res[0]\n",
    "]\n",
    "print(json.dumps(retrieved_lines_with_distances, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use LLM to get an RAG response\n",
    "\n",
    "Before composing the prompt for LLM, let's first flatten the retrieved document list into a plain string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "context = \"\\n\".join(\n",
    "    [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define prompts for the Language Model. This prompt is assembled with the retrieved documents from Milvus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROMPT = \"\"\"\n",
    "Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.\n",
    "<context>\n",
    "{context}\n",
    "</context>\n",
    "<question>\n",
    "{question}\n",
    "</question>\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) hosted on Hugging Face inference server to generate a response based on the prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from huggingface_hub import InferenceClient\n",
    "\n",
    "repo_id = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\n",
    "\n",
    "llm_client = InferenceClient(model=repo_id, timeout=120)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Finally, we can format the prompt and generate the answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "prompt = PROMPT.format(context=context, question=question)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The legal basis for the proposal is Article 114 of the Treaty on the Functioning of the European Union (TFEU), which provides for the adoption of measures to ensure the establishment and functioning of the internal market. The proposal aims to establish harmonized rules for the development, placing on the market, and use of AI systems in the Union following a proportionate risk-based approach.\n"
     ]
    }
   ],
   "source": [
    "answer = llm_client.text_generation(\n",
    "    prompt,\n",
    "    max_new_tokens=1000,\n",
    ").strip()\n",
    "print(answer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Congratulations! You have built an RAG pipeline with Hugging Face and Milvus."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
